Building a text collection for Urdu information retrieval

نویسندگان

چکیده

Urdu is a widely spoken language in the Indian subcontinent with over 300 million speakers worldwide. However, linguistic advancements are rare compared to those other European and Asian languages. Therefore, by following Text Retrieval Conference standards, we attempted construct an extensive text collection of 85 304 documents from diverse categories covering 52 topics relevance judgment sets at 100 pool depth. We also present several applications demonstrate effectiveness our collection. Although this primarily intended for retrieval, it can be used named entity recognition, summarization, suitable modifications. Ours most existing language, will freely available future research academic education.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Building a Heterogeneous Information Retrieval Collection of Printed Arabic Documents

This paper describes the development of an Arabic document image collection containing 34,651 documents from 1,378 different books and 25 topics with their relevance judgments. The books from which the collection is obtained are a part of a larger collection 75,000 books being scanned for archival and retrieval at the Bibliotheca Alexandrina (BA). The documents in the collection vary widely in ...

متن کامل

Building a test collection for speech-driven web retrieval

This paper describes a test collection (benchmark data) for retrieval systems driven by spoken queries. This collection was produced in the subtask of the NTCIR-3 Web retrieval task, which was performed in a TREC-style evaluation workshop. The search topics and document collection for the Web retrieval task were used to produce spoken queries and language models for speech recognition, respecti...

متن کامل

Building A Large Thesaurus For Information Retrieval

Information retrieval systems that support searching of large textual databases are typically accessed by trained search intermediaries who provide assistance to end users in bridging the gap between the languages of authors and inquirers. We are building a thesaurus in the form of a large semantic network .to support interactive query expansion and search by end users. Our lexicon is being bui...

متن کامل

A Text Clustering Framework for Information Retrieval

Text-mining methods have become a key feature for homeland-security technologies, as they can help explore effectively increasing masses of digital documents in the search for relevant information. This research presents a model for document clustering that arranges unstructured documents into content-based homogeneous groups. The overall paradigm is hybrid because it combines pattern-recogniti...

متن کامل

Building a Domain-specific Document Collection for Evaluating Metadata Effects on Information Retrieval

This paper describes the development of a structured document collection containing user-generated text and numerical metadata for exploring the exploitation of metadata in information retrieval (IR). The collection consists of more than 61,000 documents extracted from YouTube video pages on basketball in general and NBA (National Basketball Association) in particular, together with a set of 40...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Etri Journal

سال: 2021

ISSN: ['1225-6463', '2233-7326']

DOI: https://doi.org/10.4218/etrij.2019-0458